require(knitr)
## Loading required package: knitr
## Warning: package 'knitr' was built under R version 3.1.3
require(ggplot2)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.1.3
opts_chunk$set(echo=TRUE)

Abstract

My notes of the Statistical Inference at coursera’s Data Science Specialization.

Statistical Inference

Statistical Inference is the process of generating conclusions about a population from a noisy sample.

Week 1

Lecture 1: Probability

Given a random experiment (say, rolling a dice) is a population quantity that summarize the randomness.

Rules of Probability

  • Specifically, probability takes a possible outcome from the experiment and: - assigns it a number between 0 and 1 - so that the probability that something occurs is 1 (the dice must be rolled) and - so the probability of the union of any two sets of outcomes that have nothing in common (mutually exclusive), is the sum of their respective probabilities.

Example:

If A and B cannot both occur: P(A U B) = P(A) + P(B)

Consecuences of the Rules of Probability

  • The probability that nothing occurs is 0
  • The probability that something occurs is 1
  • The probability of something is 1 minus the probability that the opposite occurs
  • The probability of at least one of two things (or more) that cannot occurs at the same time (that are mutually exclusive) is the sum of their respective probabilities
  • If an event A implies the occurrence of event B, then the probability of A occurring is less that the probability of B occurs
  • For any two events the probability that at least one occurs is the sum of their probabilities minus their intersection

Lecture 2: Probability of Mass Functions

**A random variable is a numerical outcome of an experiment. A random variable can be:

PMF (Probability Mass Function)

A PMF is the function that returns the probability that a random discrete variable occurs.

  1. It must always by igual or higher than 0
  2. The sum of the probability off all possible random variables outcomes is equal to 1

Example of a PMF

Flip of a coin X = 0 represents tails and X = 1 represents heads:

    p(x) = (1/2)^x (1/2)^(1-x) for x = 0,1 =>
    p(0) = (1/2)^0 (1/2)^1 = 1/2
    p(1) = (1/2)^1 (1/2)^0 = 1/2
    

But, this is a fair coin. For an unfair coin:

    p(x) = D^x (1 - D)^(1-x) for x= 0,1 =>
    p(0) = D^0 (1 - D)^(1 - 0) = (1 - D)
    p(1) = D^1 (1 - D)^(1 - 1) = D

So, when the coin is fair, D = 0.5 (1/2)

Lecture 3: Probability Density Functions (PDF)

PDF is a function associated to continouns random variable. It must comply with the following rules:

  1. It must be larger or equal to zero everywhere
  2. The total area under it must be equal to one

Example of a PDF

x <- c(-0.5, 0, 1, 1, 1.5)
y <- c(0,0,2,0,0)
plot(x, y, lwd=3, frame=FALSE, type="l")

# This is actually the Beta Probability Density Function. In R we have the function pbeta(p, p1, p2)
# What is the probability that 75% of this population gets draw
pbeta(0.75,2,1) # 2 and 1 are the height and the base of the triangle
## [1] 0.5625
# so the probability is 56.25%

We can see that this can be a PDF. The Area below the plot is equal to one (b x h)/2 = 1x2.0/2 = 1

CDF and survival function

The CDF (Cummulative Distribution Function) of a random variable, X, returns the probability that the random variable is less than or equal to x:

    F(x) = P(X <= x)
    

The Survival function of a random variable X is defined as the probability that the random variable is greater than x:

    S(x) = P(X > x)

So, notice that:

    S(x) = 1 - F(x)

Example

    F(x) = P(X <= x) = 1/2(Base x Height) = 1/2 x (2x) = x^2
    
    Then S(x) = 1 - x^2
    

Quantile

The alpha-th quantile of a distribution with a CDF F is the point x-alpha so that:

    F(x-alpha) = alpha
    
  • A percentile is simply a quantile with the alpha expressed as a percent
  • The median is the 50th percentile

Lecture 4: Conditional Probability

Given the Probability of A when B occurs, the conditional probability of A is:

    P(A|B) = P(A interception B) / P(B)
    

If A are B are unrelated, then P(A|B) = P(A)

Lecture 5: Baye’s Rule

Lecture 6: Indepence

Event A is independence of event B if:

    P(A|B) = P(A) where P(B) > 0
    

also P(A inter B) = P(A) x P(B)

So, you cannot multiply probabilities unless you know the events are independent.

IID random variables

Random variables are said to be IID (Idependent and Identically Distributed):

  • Independent: statistically unrelated from one and another
  • Identically Distributed: all having been drawn from the same population distribution

Lecture 7: Expected Values

Expected Values are values that characterized a population.

Our sample expected values (the sample mean and variance) will estimate the population versions

    E[X] = Sum_x (x p(x))
    
    E[X] is the center of mass of a collection of locations and weights {x, p(x)}
    

Example

Suppose that a die is rolled and X is the number faced up. What is the Expected Value of X?

    E[X] = 1x1/6 + 2x1/6 + 3x1/6 + 4x1/6 + 5x1/6 + 6x1/6 = 3.5
    

Distribution of average of a population

  • The average of a random variable is a random variable itself and it also has a distribution
  • The distribution of averages of a population is centered in the same point as the center of the distribution of the population

Sumarizing

  • Expected values are properties of the distribution (mean, variance, standard deviation)
  • The population mean is the center of mass of the population
  • The sample mean is the center of mass of the observated data
  • The sample mean is an estimate of the population mean
  • The sample mean is unbiased: - The population mean of its distribution is the mean that it is trying to estimate
  • The more data that goes into the sample mean, the more concentrated its density/mass function is around the population mean

Formulas Summary

Week 2

Variance

The standard deviation is the square root of the variance.

In the case of the flip of a coin, the variance is such that:

V(x) = p(1 - p), where P is the probability of one of the faces of the coin

###Sample variance

The sample variance:

  • Has an associate population distribution
  • Its expected value is the same value of the variance of the population it is trying to estimate.

As you collect more and more data, the distribution of the sample will be more concentrated around the population variance it is trying to estimate.

And the square root of the sample variance is the sample standard deviation.

The Simulation Experiment

As I get more and more data, the sample distribution will be more concentrated around the variance of the population that the sample is trying to estimate:

In a simulation what I do is to repeat the experiment as many times as the sample I am selecting and calculate the variance of that repetition experiment. Then what I do is to repeat this set of repetitions many, many times (like thousands of times) and calculate each time the variance of the sample and I get a distribution that is concentrated around the variance of the total population. The larger the sample (the higher the number of repetitions per experiment) the more concentrated is the experiment distribution around the variance of the population.

The following example is the repetition thousand of times of the experiment of roll 10 dices, 20 dices and 30 dices. The distributions in the picture are the distribution of the variance of each repetition of 10, 20 and 30 samples (die rolls). Look that the higher the number of repetitions the more concentrated is the distribution of the resulting experiment:

Simulation Experiments with the Mean

Remember that the average (the mean) of a sample of variables is also a random variable with its own distribution. The sample mean is the same of the mean of the original population:

    E[^X] = u   (^X is the sample of the population, u is the mean of the total population)
    

So, the square root of the variance is the standard deviation. We call it the standard error.

Summary

  1. The sample variance S^2, estimates the population variance (sigma sq)
  2. The distribution of the sample variance S^2 is centered around sigma sq (the variance of the population)
  3. The variance of sample mean is variance of the population devided by n, where n is the number of repetition of the experiment
  4. So the variance of the sample mean is = sigma sq/n
  5. From 4, we can deduct that the logical esptimate of the variance of the sample mean is S^2/n
  6. And the logical estimate of the standard error is S/sq root of n

Conclusions

  • The sample variance estimates the population variance
  • The distribution of the sample variance is centered at what it is estimating
  • It gets more concentrated around what it is estimating as one collects more data (more repetitions per experiment)
  • The variance of the sample mean is equal to the population variance devided by n, where n is the number of repetitions per experiment

Binomial Distribution

Bernoulli Distribution

The Bernoulli distribution arises as the result of a binary outcome, i.e, the flip of a coin, where p is the probability of one of the faces of the coin:

Where (n x) reads “n choose x”, counts the number of ways of selecting x items out of n without replacement disregarding the order of the items.

Example of a binomial distribution

Imagine a couple that has 8 children, 7 out of which are girls and none are twins. If the probability of give birth a boy or a girl is 0.5 (p = 0.5), then the probability of have 7 girls out of 8 children is the following:

    (8 7)*0.5^7(1-0.5)^(8-7) + (8 8)*0.5^8(1 - 0.5)^(8 - 8)
    

In R, this is the code:

choose(8,7)*0.5^7*(1-0.5)^(8-7) + choose(8,8)*.5^8
## [1] 0.03515625
#Also there is a binomial probability function built in in R pbinom#
pbinom(6, size=8, prob=0.5, lower.tail = FALSE)
## [1] 0.03515625

The Normal Distribution

In any normal distribution, the area below the density curve between one and minus one standard deviation covers about 68% of the distribution.

Then the area between -2 and +2 standard deviation is about 95% of the density:

And then, the area between -3 and +3 sd is about 99% of the density:

How to Convert the Normal Distribution to Standard Nomal Distribution and viceversa

If we have X that is a Nomal Distribution, then we can convert it to the Standard Normal Distribution (we call it Z) the following way:

So, 68%, 95%, and 99% of a nomal density lies within 1, 2 and 3 standard deviations from the mean respectively.

Similary, we should remember the following percentiles of the normal distribution:

In general, a typical calculation to remember is:

What is the probability that a N(u, sigma2) RV is larger than x?

The Poisson Distribution

Some uses of the Poisson Distribution:

  • Modeling count data
  • Modeling event-time or survival data
  • Modeling contingency tables
  • Approximating binomials when n is large and p is small

Note that:

  • The mean of a Possion Distribution is equal to lambda
  • The variance is also equal to lambda (l)

Other use of the Possion Distribution is to model rates:

  • X is a Poisson(l*t) where:
  • l = E[X/t] is the expected value of count per unit of time
  • t is the total monitoring time

Example:

The number of people that show up at a bus stop is Poisson with a mean (l) of 2.5 per hour.

If watching the bus stop for 4 hours, what is the probability that 3 or fewer people show up for the whole time?

ppois(3, lambda = 2.5 * 4)
## [1] 0.01033605

Poisson as an aproximation of Binomial

Example, we flip a coin with success probability of 0.01 five hundred times. (so n is large and p is small)

pbinom(2, size = 500, prob = 0.01)
## [1] 0.1233858
ppois(2, lambda = 500 * 0.01)
## [1] 0.124652

Law of Large Numbers

The Law of Large Numbers says that as the sample grows, its mean gets closer and closer to the mean of the whole population.

  • An estimator is consistent if it converges to what your want to estimate
  • The LLN says that the sample mean of iid samples is consistent for the population mean
  • Typically, good estimators are consistent; it’s not too much to ask that if we go to the trouble of collecting an infinite amount of data that we get the right answer.

The Central Limit Theorem

The CLT, states that the distribution of averages of iid variables (properly normilized) becomes that of a standard normal as the sample size increases

Confidence Intervals

R Code example for the Calculation of the Confidence Interval

require(UsingR)
## Loading required package: UsingR
## Warning: package 'UsingR' was built under R version 3.1.3
## Loading required package: MASS
## Warning: package 'MASS' was built under R version 3.1.3
## Loading required package: HistData
## Loading required package: Hmisc
## Warning: package 'Hmisc' was built under R version 3.1.3
## Loading required package: grid
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 3.1.3
## Loading required package: survival
## Loading required package: Formula
## Warning: package 'Formula' was built under R version 3.1.3
## 
## Attaching package: 'Hmisc'
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units
## 
## 
## Attaching package: 'UsingR'
## 
## The following object is masked from 'package:survival':
## 
##     cancer
## 
## The following object is masked from 'package:ggplot2':
## 
##     movies
data(father.son)
x <- father.son$sheight
round((mean(x) + c(-1,1) * qnorm(0.975) * sd(x)/sqrt(length(x)))/12,3)
## [1] 5.710 5.738

Confidence Interval for Binomial Variables

R Function for Binomial Confidence Inverval

# if 56 out of 100 people wants to vote for a candidate, what is the 95% confidence interval 

0.56 + c(-1,1) * qnorm(0.975) * sqrt(0.56 * 0.44/100)
## [1] 0.4627099 0.6572901
# using the built in R function

binom.test(56,100)$conf.int
## [1] 0.4571875 0.6591640
## attr(,"conf.level")
## [1] 0.95

Simulation

n <- 20
pvals <- seq(0.1, 0.9, by = 0.05)
nosim <- 1000
coverage <- sapply(pvals, function(p){
        phats <- rbinom(nosim, prob = p, size = n)/n
        ll <- phats - qnorm(0.975) * sqrt(phats * (1 - phats)/n)
        ul <- phats + qnorm(0.975) * sqrt(phats * (1 - phats)/n)
        mean(ll < p & ul > p)
})
plot(pvals, coverage, type="l")
abline(h=0.95)

LNL and CLT Summary

  • The LLN states averages of iid samples converge to the population means that they are estimating.
  • The CLT that averages are approximately normal, with distributions - centered at the population mean - with standard deviation equal to the standard error of the mean - CLT gives no guarantee that n is large enough

Confidence interval for the mean: one can gets a confidence interval for the mean by taking the mean and subtracting and adding the relevant normal quantile times the standard error. - Adding and subtracting 2 SEs works for 95% intervals

  • The Poisson and Binomial case have exact intervals that don’t require the CLT - But a quick fix for small sample size binomial calculation is to add 2 successes and 2 failures

Additional Thought About the Meaning of Confidence Inverval

From this Wikipedia article, I learned about the actual insterpretation of the Confidence Interval.

For example, for the 95% Confidence Interval, that is very commonly used, it is wrong to say:

  • I have this 95% Confidence Invertal, so there is 95% probability that the population mean \(\mu\) is within this confidence interval.

Instead, what the 95% means is that as we repeat the calculation of the CI for each sample, the population \(\mu\) will be contained within the CI on 95% of the times. That 95% is the Confidence Level.

If \(\hat{X}\) is a sample and the mean of the samples is normally distributed, the 95% CI is calculated as:

\[\hat{X} \pm 1.96\frac{\sigma}{\sqrt{n}}\]

And if X is binomial (i.e coin flip, elections votes yes or no, etc), when p is the probability of one of the options in the population and \(\hat{p}\) is the mean of a sample of X (\(\hat{X}\)), then the CI can be estimated very closely to:

\[\hat{P} \pm \frac{1}{\sqrt{n}}\]

T Confidence Intervals

If you have to choose between use the Z Interval or the T Interval, always use the T interval, because as we collect more data the T interval becomes very aproximate to the Z interval:

\[Est \pm TQ x SE_{Est}\]

The T CI is based in the “T Distribution”, which is not normal. When the n of the sample is small and you use standard normal you get CI that are too narrow, so for low values of n you use the T CI. As n gets larger this distinction becomes irrelevant:

\[\frac{\bar{X} - \mu}{\frac{S}{\sqrt{n}}}\]

This confidence intervals follows Gosset’s t distribution with n-1 degress of freedom.

The T Interval is calculated as follow:

\[\bar{X} \pm t_{n-1}\frac{S}{\sqrt{n}}\] where \(t_{n-1}\) is the relevant T quantile

Some notes about T intervals:

  • The t intervals technically assumes that the data are iid normal
  • It works well whenever the distribution of the data is roughly symmetric and mound shaped
  • Paired observations are often analyzed using the t inteval by taking differences or log of differences
  • For large dress of freedom (n is larger), t intervals become the same as standard normal quantiles; therefore this inteval converges to the same interval as the CLT yielded
  • For skewed distribution (distribution that as one tale longer than the other), the spirit of the t interval assumptions are violated, so:
  • In this kind of distribution does not make sense to center the interval at the mean
  • In this case, consider taking logs or using a different summary like the median
  • Also, for highly discrete data, like binary, is better to use other intervals

Example of t interval with the sleep dataset

Let’s use the sleep dataset of R. This dataset shows the increase in sleeping hours of 10 indivituals after taking certain sleeping medication. We are going to consider that group 1 and 2 are the same group of individuals before and after take the medication.

data(sleep)
head(sleep)
##   extra group ID
## 1   0.7     1  1
## 2  -1.6     1  2
## 3  -0.2     1  3
## 4  -1.2     1  4
## 5  -0.1     1  5
## 6   3.4     1  6
g <- ggplot(sleep, aes(x = group, y = extra, group = factor(ID)))
g <- g + geom_line(size = 1, aes(colour = ID)) + geom_point(size =10, pch = 21, fill = "salmon", alpha = .5)
g

Now let’s calculate the mean and the sd of the difference between the two groups. We are taking the data as paired, so as if group 1 and 2 are the same subjects before and after take the sleep medication.

g1 <- sleep$extra[1 : 10]; g2 <- sleep$extra[11 : 20]
difference <- g2 - g1
mn <- mean(difference); s <- sd(difference); n <- 10

Now let’s calculate the t interval using the formula and also using the build in R function:

#calculating the interval with the formula
mn + c(-1, 1) * qt(.975, n-1) * s / sqrt(n)
## [1] 0.7001142 2.4598858
#calculating the interval with the R function
t.test(difference)
## 
##  One Sample t-test
## 
## data:  difference
## t = 4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.7001142 2.4598858
## sample estimates:
## mean of x 
##      1.58
t.test(g2, g1, paired = TRUE)
## 
##  Paired t-test
## 
## data:  g2 and g1
## t = 4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.7001142 2.4598858
## sample estimates:
## mean of the differences 
##                    1.58
#other way to use the R function
t.test(extra ~ I(relevel(group, 2)), paired = TRUE, data = sleep)
## 
##  Paired t-test
## 
## data:  extra by I(relevel(group, 2))
## t = 4.0621, df = 9, p-value = 0.002833
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.7001142 2.4598858
## sample estimates:
## mean of the differences 
##                    1.58

How to interpret the result we got in the t interval of the sleep dataset

So the t interval we got from the last code chunk is 0.7 to 2.46

As we calculated the 95% t interval, this means that if we repeat this experiment many times in 95% of the times the mean of the population we are estimating will be in between the interval: 0.7 to 2.46

Confidence intervals for two independent groups

When we need to calculate the CI for two independent groups, for example, we need to measure the blood preasure of two different random groups of subjects, one taking a medication and the other taking a placebo, those groups are independent, so we can’t use the t interval as paired. So the interval calculation is as follow:

Therefore a \((1 - \alpha)\times 100\%\) confidence interval for \(\mu_y - \mu_x\) is \[ \bar Y - \bar X \pm t_{n_x + n_y - 2, 1 - \alpha/2}S_p\left(\frac{1}{n_x} + \frac{1}{n_y}\right)^{1/2} \] The pooled variance estimator is \[S_p^2 = {(n_x - 1) S_x^2 + (n_y - 1) S_y^2}/(n_x + n_y - 2)\] Remember this interval is assuming a constant variance across the two groups

With two independent groups (x and y), the degree of fredom is calculated as:

\[n_{x} + n_{y} - 2\]

Remember that if the variance is not about the same between the two groups, this inteval won’t give you a proper coverage.

Example of confidence interval calculation for two independent groups

Suppose that you can see the effect in the blood preasure of taking a oral contraceptive. We have 8 oral contraceptive users vs 21 controls: - \(\bar{X}_{OC} = 132.86\) mmHg with \(S_{OC} = 15.34\) . The S is the SD - \(\bar{X}_{C} = 127.44\) mmHg with \(S_{C} = 18.23\)

In order to calculate the interval, let’s first calculate the Pooled Variance:

sp <- sqrt(((8-1)*15.34^2 + (21-1)*18.23^2)/(8+21-2))
132.86 - 127.44 + c(-1,1) * qt(0.975,8+21-2)*sp*(1/8 + 1/21)^0.5
## [1] -9.521097 20.361097

So, we got -9.521097 20.361097. This is the 95% confidence interval.

Interpretation of the confidence interval result

In a t interval we are getting the CI of the mean of the differences between the two groups. In the last example, as zero is inside the interval we got we cannot rule out that probably there is not diffence between the average blood preasure between the oral contraceptive users and the control group.

Unequal variances

  • Under unequal variances \[ \bar Y - \bar X \pm t_{df} \times \left(\frac{s_x^2}{n_x} + \frac{s_y^2}{n_y}\right)^{1/2} \] where \(t_{df}\) is calculated with degrees of freedom \[ df= \frac{\left(S_x^2 / n_x + S_y^2/n_y\right)^2} {\left(\frac{S_x^2}{n_x}\right)^2 / (n_x - 1) + \left(\frac{S_y^2}{n_y}\right)^2 / (n_y - 1)} \] will be approximately a 95% interval
  • This works really well
  • So when in doubt, just assume unequal variances

In R, whenever you have or suspect unequal variances, you might want to use the formula:

    t.test(..., var.equal = FALSE)
    

Hypothesis Testing

Truth | Decide| Result

H0| H0| Correctly accept null H0| Ha| Type I error Ha| Ha| Correctly reject null Ha| H0| Type II error

Example

So, a reasonable strategy would reject the null hypothesis if \(\bar{X}\) was larger than some constant, C. Typically, C is chosen so that the probability of a Type I error, \(\alpha\), is 0.05

\(\alpha\) = Type I error rate = Probability of rejecting the null hypothesis when, if fact, the hypothesis is correct.

\(\mu\) = 30 sd = 10 n = 100

\[\frac{10}{\sqrt{100}} = 1 \]

Under \(H_{0}\bar{X} \sim N(30,1)\)

We want to chose C so that the P(\(\bar{X}\) > C;\(H_{0}\)) is 5%

The 95th percentile of a normal distribution is 1.645 sd from the mean

if C = 30 + 1 x 1.645 = 31.645

So, the rule “Reject H0 when ${X} >= 31.645” has the property the probability of rejection is 5% when H0 is true (for the \(\mu_{0}\), \(\sigma\) and n given)

We don’t usually convert the constant C back to the units of the original data. What we do is to calculate the sample mean and see how far it is from the population mean. In the following example, the sample mean \(\bar{X} = 32\), so:

\[\frac{32 - 30}{\frac{10}{\sqrt{100}}}=2\]

And this is greater than 1.645. As we got this sample mean of 2 and we know that the chances that this occur when H0 is true are less than 5%, then in this case we should reject the no Hypothesis (H0).

Or, whenever \[\frac{\sqrt{n}(\bar{X}-\mu_{0})}{s} > Z_{1-\alpha}\] where s is the standard deviation of the population. we reject the H0 hypothesis.

Example for low values of n

Consider that n = 16 (rather than 100)

The statistic

$$

follows a T distribution with 15 (n-1) degrees of freedom under H0.

Under H0, the probability that it is larger than the 95% of the T distribution is 5%

The 95th percentile of the T distribution with 15 df is 1.7531

qt(0.95, 15)
## [1] 1.75305

So, for our example of the mean of the sample being 32, our test statistic is:

$$ = 0.8

As the 95th percentile of the T distribution with 15 df is 1.7531, we can’t reject the H0 in this example (because 0.8 < 1.7531)

P-Values

Example

Suppose that you get a T statistic of 2.5 for 15 df testing \(H_{0} : \mu = \mu_{0}\) vs \(H_{a} : \mu > \mu_{0}\)

  • What is the probability of getting a T statistic as large as 2.5:
pt(2.5, 15, lower.tail = FALSE)
## [1] 0.0122529

So, the probabily is 1.2%. So, either H0 is true and we just got a highly rare T statistic or H0 is false.

Finally, the way you use the P-value is: - If the P-value is less than the \(\alpha\) you reject the null hypothesis.